NBA MVP Analysis

a) Introduction to the research space

Aim:

Who should be the Most Valuable Player(MVP) of the NBA 2021-2022 season other than Nikola Jokic. For the sake of simplicity, MVP will be used in the dataset for easy reading.

Dataset:

The data sources I am using is from the NBA website, data source consideration will be from kaggle.

Summary:

When it comes to awards, especially in sports, there are many questions and debates as to who truly deserve it. The only time it was more or less unanimous was when it comes to Kareem Abdul Jabbar or Michael Jordan. In the NBA, this debate leads to a lot of controversies and it spreads to many other categories as well in the NBA, such as the Defensive Player of the Year and 6th Man of the Year. In this research, we will analyse and observe if Nikola Jokic should be the MVP for the 2021-2022 NBA season (not regarding Playoffs).

Candidates for the Regular Season MVP is in this article: https://www.nba.com/news/kia-mvp-ladder-april-15-edition

b) Data is relevant to project aims/objectives and use of data source is clearly justified

First Data Source:

Origin of data: The NBA traditional stats page, it will be scraped and used for comparing the leading candidates in consideration for the NBA MVP award.

Source: https://www.nba.com/stats/players/traditional/?sort=PTS&dir=-1&Season=2021-22&SeasonType=Regular%20Season

Format: HTML table

This dataset will be the main focus for this project and has all the information that I require to compare the candidates. The dataset consists of a table with all the players playing in the NBA 2021-2022 Season and arranged by points per game. Points per game is preferrable as it shows the MVP candidates easily at the top 15 easily. I will be web scraping to retrieve the data via the website.

Second Data Source:

Origin of data: The second data source is from the NBA Teams traditional stats page, I will scrape this to show this player affects the team in terms of winning.

Source: https://www.nba.com/stats/teams/traditional/?sort=W_PCT&dir=1&Season=2021-22&SeasonType=Regular%20Season

Format: HTML table

This dataset is to get the information of the teams and its winning percentages. As well as the plus/minus of the teams. I will be getting this from the NBA website as well.

Consideraton for Data Source:

Origin of data: I found another dataset on kaggle with similar findings but focuses mainly on the stats for Kobe Bryant, Michael Jordan and Lebron James. This kaggle dataset is being considered because these 3 players are considered to be the greatest who have ever played the game.The strengths of this datasource is that it provides me with the opportunity to compare the current MVP to the greats of the game, especially when there are players that win multiple MVP awards.

Source: https://www.kaggle.com/datasets/xvivancos/michael-jordan-kobe-bryant-and-lebron-james-stats?select=advanced_stats.csv

Format: CSV

c) Project background is clearly defined (e.g. use of literature, research or pre-analysis)

Field of interest/relavance

I followed the NBA for about 5 years now and personally I felt there were some players that were snubbed for the MVP award. For example, I personally think the 2019-2020 NBA Season MVP should be Lebron James. I think Giannis is great in the 2019-2020 NBA Season but the team got eliminated in the playoffs while the Lakers dominated to the championship. I personally think this 2021-2022 NBA MVP should be Stephen Curry but we shall see as I try to analyse the data.

Has the topic been explored and/or research questions have not already been answered

MVP topics have been discussed before but not in a specific season. Usually the research revolves around the more popular NBA legends like Michael Jordan or which team is the best team and their best players. I found an analysis that revolves around the age and how well they play with each NBA season which I thought was very interesting. There are discussions on current players being compared to the NBA legends, to determine if this particular player is going to be a legend in the making. Other than the mentioned sources, I personally have not found any analysis in regards to the current season and its MVP.

Scope of work

I will analyse the variables and provide a glossary of the terms used in the NBA for clarity. The

Steps and stages for analytical data processing pipeline

The steps are as follows:

1) I will get the data via web scraping at the nba website.

2) Next I will get only the NBA candidates and their respectives teams. I will compare them from there as well as how well they do. I will be removing some columns in the tables such as fantasy points and triple doubles in the game.

3) Finally, analyse the data by plotting graphs.

Description as of how I wil evaluate my aims and objectives based on my approach

I hope to do so by analysing the stats of the 15 MVP candidates, the one that wheels the team and himself to victory should be the MVP.

Glossary

Term Explanation
GP Games Played
W Wins
W_PCT Winning Percentage
REB Rebounds
AST Assists
TOV Turnovers
PTS Points
W_PCT_RANK Winning Percentage Ranking
PLUS_MINUS_RANK Plus Minus Ranking

d) Dataset explored technically

importing tools & packages

Reading URLS and scraping

This is the URLs for all the NBA players and all the NBA teams.

I learnt of this way of extracting the NBA stats on youtube, I am crediting the link here and as well as below as reference used. I will also create a reference list below for this.

https://github.com/rd11490/NBA_Tutorials/tree/master/finding_endpoints

We then save the requests to a response variable so we can access the specific json data. I made one each for players and teams.

The columns list consists of all the variable names of the columns. This is what determines the headers for the stats.

Time to make the dataframe for the nba players and using the columns list above as headers. After making and saving the dataframe, I sampled 5 as an example

Same for the team, I made the dataframe and the columns list for team, and then I sampled 5 as an example

Data Cleaning and Processing

Process

I noticed there were a lot of things I do not really need like the double doubles and fantasy points. There are some things that are debatable in the columns I have dropped such as MIN (Minutes) and FT_PCT (Free Throw Percentage). To explain myself, for example, I dropped the minutes (MIN) column because the players chosen are considered the top of the game and they are expected to be played more that other players. It makes sense to me to drop the column this way and it makes the data feel stable. Points is a different scenario because the most valuable player is usually the ones that score more and lifts the team.

I tried to keep the headers used neat and only used the stats that most people would check on such as points, turnovers, winning percentage.

Checking scraped data

After cleaning the data, I will check the scraped data to see if there is nothing outside of expected bounds. Players dataframe seems to be ok, PLUS_MINUS is ok to have a negative number because players can sometimes have a negative impact on the court too. TEAM_ID is acceptable because I will not be using TEAM_ID for analysing aspects, is just for the convenience of merging two dataframes.

Checking the scraped data for NBA teams to see if there is nothing outside of expected bounds. TEAM_ID is acceptable because I will not be using TEAM_ID for analysing aspects, is just for the convenience of merging 2 dataframes.

Cleaning the NBA Players Table

Part 1

This part was the tricky part for this project. I first sort them according to PTS because majority of the MVP candidates are in the top 15. However, one of the MVP candidates is not in the top of the 15 or even top 50 among PTS. The candidate I need was Chris Paul which was sitting at rank 97 for points.

I went to figure out how to slice the index first, and then I reset the index so I can use it to slice and remove the players between Karl Anthony Towns and Chris Paul. It took me a few tries to solve the issue, as I was stuck in trying to figure out why the index sliced was not to where I wanted.

Finally, to confirm I was on the right move, I check the dataframe and I am glad I got what I wanted.

Part 2

The player that is not considered for MVP was Kyrie Irving so I removed his row and reset them. This concludes my steps to getting the players table done.

Cleaning the NBA Teams table

Cleaning the teams table is way easier than the players table,

I thought about keeping the teams that do not have a MVP candidate but ultimately removed them from the team and reset index. I sort them according to the W_PCT (Winning Percentage) as it feels more accurate.

I did not sort them according to W_PCT_RANK because there are 3 teams with the same position. For PLUS_MINUS_RANK, there seems to be a gap but I think is because of the previous teams that have been removed by index.

Dataset prepared

Data set has been processed to remove illegal values

There were no illegal values for this dataset, I like it the way it is and is not too confusing.

Data is in the correct format for analysis e.g. numpy nd array, dataframe, with a clear distinction as to why this format is correct and appropriate.

Data are in the dataframe format that is easy to use for graphs/charts/plots

Checks have been done for out of bound values or for numeric and categorical quantities

Key variables such as 'PLUS_MINUS' for players are allowed to have negative values, and all other key variables does not have any out of bound values. Data is not sorted

Depth of exploration draws out some interesting or valuable insights

Exploration shown below

Data is in an appropriate format to carry out further analysis

Dataframe is able to generate charts, graphs, plots, easily with help from plotly or matplotlib etc.

Data Analysis

Time to analyse the dataset, I will be using Plotly for the graphs. I will credit the link under the reference list at the end.

If the graphs do not run or show when opened, please go to File -> Trust Notebook, usually this step allows the graphs to be loaded. If not, please restart and run it again.

Scatter Matrix, Among NBA players (PTS, REB, AST, TOV)

I used a scatter matrix to display the 4 columns of PTS, REB, AST and TOV. I personally wanted to do a barchart with the columns at stacked together but I felt it might be too clustered. Hence, why I decided to go for scatter matrix.

The first analysis I did was to compare the players among points, rebounds, assists and turnovers. What usually stands out for MVPs are their ability to knock down the shots (points), make their team better (assists and rebounds) and not turning the ball over. We can see from the scatter matrix below, Joel Embiid scored the most points, Nikola Jokic secured the most rebounds, Chris Paul dished out the most assists and the one with the fewest turnovers are Demar Derozan and Devin Booker.

For points, we can see majority scored more than 25 points per game, the only candidate to not do so is Chris Paul who averaged 14.7 points per game (ppg). 3 players have a lot of points and rebounds, which is Giannis Antetokounmpo, Joel Embiid and Nikola Jokic. For points and assists, we have more than 4 players averaging more than 7 assists, which are Nikola Jokic, Trae Young, Luka Doncic and Chris Paul. Finally, multiple players have quite a lot of turnovers. This is usually because when opposing teams play defense on these players, they try not to let them score as these players who averaged a lot of points per game are usually the players that can change the momentum of the game. The one with the most turnovers but with most points is Luka Doncic, the players with a lot of points but fewer turnovers is Demar Derozan and Devin Booker, fewer points but low turnovers is Chris Paul.

To conclude this part of the analysis, Nikola Jokic is making a strong statement and is understandable why he won the MVP for the 2021-2022 NBA season. One of the top few averaging a lot of points, rebounds and assists, he is truly a Center doing it all on the court. However, he does have 3 to 4 turnovers, if minimized he would be harder to deal with against any opposing team. Jokic seems to be in the lead for now, in this part of the analysis.

Treemap, Player Wins (GP, W, W_PCT, PLUS_MINUS)

The treemap below shows the number of games played and won by the players, as well as the winning percentage and plus/minus. Plus/minus means the positive or negative impact the player have on the team. I chose to have the winning percentage and plus/minus shown in the treemap, and then when you hover over to the boxes of each players, you can see the games played and won.

From this analysis, we can see that Devin Booker has the most wins and the highest winning percentage and the fourth highest plus minus of the group. Trailing behind him is his teammate Chris Paul. The lowest winning percentage and plus/minus, is Lebron James with the lowest winning percentage and a negative plus/minus which means that he does not have a positive impact on his team as a whole, despite averaging 30.3 ppg which is very shocking. I did not expect Lebron James to be this low in this Player Wins analysis.

Conclusion for this part of the analysis, Devin Booker far exceeds majority of the other candidates, though it seems he is slightly above Chris Paul but this is because they are both teammates on the same team. Nikola Jokic is not too far behind with a winning percentage of 0.622 and a positive plus/minus rating of 6. He did play more games than Booker and finished winning 62.1% of the games he played and won (it is calculated by the dividing the games won over the games played, and then multiplying by 100). Nevertheless, Booker had 20.3% more games won and played than Jokic. Booker is also 0.9% more than Chris Paul for the games played and won, though is because Chris Paul played lesser games than Booker.

Team Wins

3D Scatter (W_PCT, W_PCT_RANK, PLUS_MINUS_RANK)

I chose a 3D scatter because it covers all the columns for the teams.

Moving to the team wins, the one with the highest winning percentage, plus/minus rank and winning percentage ranks, are the Phoenix Suns. The Suns dominated in this analysis being number 1 for everything. Is truly impressive with how they got number 1 at everything and how they were at least 0.7 while majority had 0.6 in winning percentage. Right behind the Suns was the Grizzlies and the Warriors. The one with the lowest wins and plus/minus as a team are the Los Angeles Lakers. This shows despite having a player with 30.3 ppg, team effort is important.

Nikola Jokic plays for the Denver Nuggets, who currently right at the middle for this plot, they have a slightly better result than the Minnesota Timberwolves.

In conclusion for this analysis, the Suns win this, which means teammates Devin Booker and Chris Paul have contributed a lot to the team.

Merging dataframes for further analysis

For the final part of the analysis, I will be merging the dataframes via the TEAM_ID. After doing so, I will be analysing and comparing the players and teams winning columns. This part allows me to go in depth to the connection between the players and teams, are the MVP candidates part of the reason for the winning success of the team.

I tried to use the index as the merge point but it does not work at all.

Player-Team Wins

3D Scatter (W_PCT_x, W_PCT_RANK, W_PCT_y, PLUS_MINUS_RANK)

I chose a 3D scatter because it covers all the winning columns for players and teams. I do not need the games win but more on the winning percentage because the winning percentage (in this case for players, represented by W_PCT_x, because both players and teams dataframes have the same column name), allows me to see if the player was key and responsible for most of the wins.

For this final part of the analysis, is clear that Devin Booker and Chris Paul are very important to the success of the Phoenix Suns. Devin Booker was slightly better than Chris Paul in terms of winning percentage (they are very small and roughly at the top, the size is small due to the size being the PLUS_MINUS_RANK). I tried to change the size to something bigger but their scatter points seems to coincide with each other which makes it even harder to see.

Jokic (current MVP of 2021-2022 NBA Season), helped the Nuggets to the 6th seed in the Western Conference and in this plot, lands in the middle out of all the other teams with their own MVP candidates. I can safely say Booker was definitely valuable to this Suns team, great winning percentage for both player and team.

In conclusion for this analysis, I will give Devin Booker the nod, followed by Chris Paul and Stephen Curry. I thought Jokic will be higher but I guess not.

Conclusion

From my analysis above, I conclude that Devin Booker should be the MVP for the 2021-2022 NBA Season. In terms of contribution, he is indeed vital and is the leading scorer for this Suns roster. He helped the Suns to the 1st team in the NBA and the 1st team in the Western Conference.

I personally thought, that from my findings, I will get Stephen Curry to be the MVP but this analysis done changed my perspective a lot. I never knew Booker alone had such a winning impact on the Suns and he should deserve more credit for his impact. I hope he wins an MVP one day. I am not saying Nikola Jokic does not deserve the MVP as he did carried his team to the playoffs even though his team got swept in the first round of the playoffs.

e) Ethics of use of data

Description of where data is from

The first and second data is from the official NBA website, here is the site terms of use.

Source: https://www.nba.com/termsofuse

Clause 9, part 1:

By using such NBA Statistics, you agree that: (1) any use, display or publication of the NBA Statistics shall include a prominent attribution to NBA.com in connection with such use, display or publication;

Clause 9, Part 2:

(2) the NBA Statistics may only be used, displayed or published for legitimate news reporting or private, non-commercial purposes;

The third dataset from Kaggle, which I did not use, has a CC0: Public Domain license.

Consideration around implications of utilising data for purpose

I would say that it is possible if the outcome of the analysis does not fit what any fan or researcher is looking for. Nevertheless, the stats does not determine a player right for MVP as there are things you do on the court that are unable to be recorded. An example would be creating a offensive charge to an opposing team player to regain possession of the ball.

The purpose of my analysis is to see from my point of view, the conclusion I got as to who should be the MVP.

Considerations of the data processing pipeline

The stats is for anyone to view, as long as one does not do anything illegal with it.

Any potential biases of the dataset

The stats will change according to each season so it will be accurate for now. Potential biases would be the referee missing a score or overturning a score with a rule that has not been heard before. An example is Manu Ginóbili scoring a three-pointer (it was intended as an alley-oop) but the ball went in. The referees missed it and overturned the shot.

References

Candidates for the Regular Season MVP is in this article:

Source: https://www.nba.com/news/kia-mvp-ladder-april-15-edition

Extracting NBA stats

Source: https://github.com/rd11490/NBA_Tutorials/tree/master/finding_endpoints

Plotly

Source: https://plotly.com/python/